Keyword [Ensemble Adversarial Training]
Tramèr F, Kurakin A, Papernot N, et al. Ensemble adversarial training: Attacks and defenses[J]. arXiv preprint arXiv:1705.07204, 2017.
1. Overview
In this paper
- show that adversarial training converges to a defenerate global minimum
- find that adversarial training remains vulnerable to black-box attack, where we transfer perturbations computed on undefended model, as well as to proposed single-step attack
- introduce Ensemble Adversarial Training that augments training data with perturbations transferred from other model (decouples adversarial example generation from the parameters of the trained model)
- Ensemble Adversarial Training yields models with strong robustness to black-box attack
1.1. Adversarial Training
- adversarial training on MNIST yields models that are robust to white-box attacks
- MNIST dataset is peculiar in that there exists a simple ‘closed-form’ denoising procedure (namely feature binarization) which leads to similarly robust models without adversarial training. This may explain why robustness to white-box attack is hard to scale to tasks such as ImageNet
- for an average MNIST image, over 80% of the pixels are in {0, 1} and only 6% are in the range [0.2, 0.8]. Thus, for a perturbation with epsilon ≤ 0.3, binarized version of x and x_adv can differ in at most 6% of the input dimension
- Some prior works have hinted that adversarially trained models may remain vulnerable to black-box attacks
- an adversarial maxout network on MNIST has slightly higher error on transferred examples than on white-box examples
1.2. Attack Methods
FGSM
Single-Step Least-Likely Class Method (Step-LL). most effective for adversarial training on ImageNet
Iter FGSM and Iter Step-LL
- proposed randomized single-step attack
1.3. Ensemble Adversarial Training
- decouple the generation of adversarial example from the model being trained
- augments training data with adversarial examples crafted on other static pre-trained models
1.4. Experiments
- adversarial training greatly increases robustness to white-box single-step attacks, but incurs a higher error rate in a black-box setting
1.4.1. Ensemble Training
- Ensemble Adversarial Training is not robust to white-box Iter-LL and R+Step-LL sample